New Cardinality Estimation Methods for HyperLogLog Sketches

نویسنده

  • Otmar Ertl
چکیده

Œis work presents new cardinality estimation methods for data sets recorded by HyperLogLog sketches. A simple derivation of the original estimator was found, that also gives insight how to correct its de€ciencies. Œe result is an improved estimator that is unbiased over the full cardinality range, is easy computable, and does not rely on empirically determined data as previous approaches. Based on the maximum likelihood principle a second unbiased estimation method is presented which can also be extended to estimate cardinalities of union, intersection, or relative complements of two sets that are both represented as HyperLogLog sketches. Experimental results show that this approach is more precise than the conventional technique using the inclusion-exclusion principle.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

New cardinality estimation algorithms for HyperLogLog sketches

This paper presents new methods to estimate the cardinalities of multisets recorded by HyperLogLog sketches. A theoretically motivated extension to the original estimator is presented that eliminates the bias for small and large cardinalities. Based on the maximum likelihood principle a second unbiased method is derived together with a robust and efficient numerical algorithm to calculate the e...

متن کامل

Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm

We describe a new cardinality estimation algorithm that is extremely space-efficient. It applies one of three novel estimators to the compressed state of the Flajolet-Martin-85 coupon collection process. In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, the new algorithm simultaneously wins on all three dimensions of the time/space/accuracy tradeoff. Our proto...

متن کامل

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the...

متن کامل

Efficient cardinality estimation for k-mers in large DNA sequencing data sets

We present an open implementation of the HyperLogLog cardinality estimation sketch for counting fixed-length substrings of DNA strings (“k-mers”). The HyperLogLog sketch implementation is in C++ with a Python interface, and is distributed as part of the khmer software package. khmer is freely available from https://github.com/dib-lab/khmer under a BSD License. The features presented here are in...

متن کامل

LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting

​ —The information presented in this paper defines LogLog-Beta (LogLog-β). LogLog-β is a new algorithm for estimating cardinalities based on LogLog counting. The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement. Our simulations show that the accuracy provided by the new al...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1706.07290  شماره 

صفحات  -

تاریخ انتشار 2017